Skip to content

Add hl.wait & AllGather Matmul example (via hl_ext helper). #189

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 9, 2025

Conversation

joydddd
Copy link
Contributor

@joydddd joydddd commented Jun 16, 2025

joydddd added a commit that referenced this pull request Jun 16, 2025
stack-info: PR: #189, branch: joydddd/stack/5
@joydddd joydddd mentioned this pull request Jun 16, 2025
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 16, 2025
@joydddd joydddd changed the title Add & . (ptx impl). [WIP] Add hl.signal & hl.wait (ptx impl). Jun 16, 2025
@joydddd joydddd changed the base branch from joydddd/stack/4 to main June 17, 2025 21:04
joydddd added a commit that referenced this pull request Jun 17, 2025
stack-info: PR: #189, branch: joydddd/stack/5
@joydddd joydddd changed the title [WIP] Add hl.signal & hl.wait (ptx impl). Add & . (ptx impl). Jun 17, 2025
@joydddd joydddd changed the title Add & . (ptx impl). Add tl.wait (ptx impl) Jun 20, 2025
@joydddd joydddd changed the title Add tl.wait (ptx impl) Add hl.wait (ptx impl) Jun 20, 2025
@joydddd
Copy link
Contributor Author

joydddd commented Jun 24, 2025

All Gather Matmul Performance, 8xH100

examples/all_gather_matmul.py

shape dtype nccl torch_symm_mem triton helion Speedup over nccl Best Backend
(256, 6656, 4096) torch.bfloat16 240.576 509.696 274.336 272.864 1.000 nccl
(256, 6656, 8192) torch.bfloat16 481.600 545.568 560.544 527.552 1.000 nccl
(256, 6656, 16384) torch.bfloat16 933.056 1309.664 1256.640 1197.248 1.000 nccl
(256, 6656, 32768) torch.bfloat16 1852.320 2292.416 2669.696 4431.936 1.000 nccl
(512, 6656, 4096) torch.bfloat16 1772.864 1603.072 1532.064 1721.760 1.157 triton
(512, 6656, 8192) torch.bfloat16 2296.832 1596.864 2039.968 1053.984 2.179 helion
(512, 6656, 16384) torch.bfloat16 2951.552 5618.752 4682.240 2415.680 1.222 helion
(512, 6656, 32768) torch.bfloat16 13161.760 6228.160 10848.480 12085.920 2.113 torch_symm_mem
(1024, 6656, 4096) torch.bfloat16 3556.832 2825.600 2733.440 3177.760 1.301 triton
(1024, 6656, 8192) torch.bfloat16 6641.632 3352.672 4088.736 3881.216 1.981 torch_symm_mem
(1024, 6656, 16384) torch.bfloat16 4735.712 10557.312 12627.840 14168.864 1.000 nccl
(1024, 6656, 32768) torch.bfloat16 25345.888 25859.425 27778.080 34090.591 1.000 nccl
(2048, 6656, 4096) torch.bfloat16 6714.240 5583.840 5324.352 6093.664 1.261 triton
(2048, 6656, 8192) torch.bfloat16 12866.528 10761.472 12755.168 16312.288 1.196 torch_symm_mem
(2048, 6656, 16384) torch.bfloat16 25339.424 28606.527 30244.703 37253.922 1.000 nccl
(2048, 6656, 32768) torch.bfloat16 54389.023 62389.664 48514.751 63082.783 1.121 triton

joydddd added a commit that referenced this pull request Jun 24, 2025
stack-info: PR: #189, branch: joydddd/stack/5
@joydddd joydddd changed the title Add hl.wait (ptx impl) Add hl.wait & AllGather Matmul example (ptx impl). Jun 24, 2025
joydddd added a commit that referenced this pull request Jun 25, 2025
stack-info: PR: #189, branch: joydddd/stack/5
joydddd added a commit that referenced this pull request Jun 25, 2025
stack-info: PR: #189, branch: joydddd/stack/5
joydddd added a commit that referenced this pull request Jun 26, 2025
stack-info: PR: #189, branch: joydddd/stack/5
@joydddd joydddd changed the title Add hl.wait & AllGather Matmul example (ptx impl). Add hl.wait & AllGather Matmul example (via hl_ext helper). Jun 26, 2025
joydddd added a commit that referenced this pull request Jun 26, 2025
stack-info: PR: #189, branch: joydddd/stack/5
@joydddd joydddd changed the base branch from main to joydddd/stack/6 June 26, 2025 23:48
@joydddd joydddd changed the base branch from joydddd/stack/6 to main June 27, 2025 00:45
joydddd added a commit that referenced this pull request Jun 27, 2025
stack-info: PR: #189, branch: joydddd/stack/5
@joydddd joydddd changed the base branch from main to joydddd/stack/6 June 27, 2025 00:45
@joydddd joydddd changed the base branch from joydddd/stack/9 to main July 7, 2025 20:32
@joydddd joydddd force-pushed the joydddd/stack/5 branch from 7d657df to 00db4d9 Compare July 7, 2025 20:33
@joydddd joydddd changed the base branch from main to joydddd/stack/9 July 7, 2025 20:33
@joydddd joydddd changed the base branch from joydddd/stack/9 to main July 7, 2025 23:35
@joydddd joydddd force-pushed the joydddd/stack/5 branch from 00db4d9 to 6b4b621 Compare July 7, 2025 23:36
@joydddd joydddd changed the base branch from main to joydddd/stack/9 July 7, 2025 23:36
@joydddd joydddd marked this pull request as ready for review July 8, 2025 00:54
@joydddd joydddd changed the base branch from joydddd/stack/9 to main July 8, 2025 00:56
@joydddd joydddd changed the base branch from main to joydddd/stack/9 July 8, 2025 00:56
@joydddd joydddd changed the base branch from joydddd/stack/9 to main July 8, 2025 01:05
@joydddd joydddd force-pushed the joydddd/stack/5 branch from 6b4b621 to b2600cc Compare July 8, 2025 01:05
@joydddd joydddd changed the base branch from main to joydddd/stack/9 July 8, 2025 01:05
@joydddd
Copy link
Contributor Author

joydddd commented Jul 8, 2025

All Gather Matmul Performance on 8xH100

shape dtype nccl torch_symm_mem triton helion Speedup over nccl Best Backend
(128, 6656, 4096) torch.bfloat16 143.552 280.288 188.000 180.192 1.000 nccl
(128, 6656, 8192) torch.bfloat16 281.056 637.248 378.048 362.944 1.000 nccl
(128, 6656, 16384) torch.bfloat16 503.264 1177.536 858.752 786.688 1.000 nccl
(256, 6656, 4096) torch.bfloat16 269.536 315.264 274.368 280.896 1.000 nccl
(256, 6656, 8192) torch.bfloat16 468.416 547.168 556.096 523.584 1.000 nccl
(256, 6656, 16384) torch.bfloat16 928.512 1348.352 1234.272 1174.496 1.000 nccl
(512, 6656, 4096) torch.bfloat16 509.440 465.088 528.608 534.848 1.095 torch_symm_mem
(512, 6656, 8192) torch.bfloat16 984.224 852.544 829.696 3149.984 1.186 triton
(512, 6656, 16384) torch.bfloat16 1962.752 5048.416 3132.448 4739.456 1.000 nccl
(1024, 6656, 4096) torch.bfloat16 1210.656 1143.072 1025.792 1875.936 1.180 triton
(1024, 6656, 8192) torch.bfloat16 3414.464 5403.360 5204.640 4760.512 1.000 nccl
(1024, 6656, 16384) torch.bfloat16 9585.600 10560.192 9702.112 11572.288 1.000 nccl

@joydddd joydddd changed the base branch from joydddd/stack/9 to main July 8, 2025 01:17
@joydddd joydddd force-pushed the joydddd/stack/5 branch from b2600cc to 7637caf Compare July 8, 2025 01:17
@joydddd joydddd changed the base branch from main to joydddd/stack/9 July 8, 2025 01:17
"""
Wait for a signal before accessing the data tensor.
Args:
signal_pad: The signal tensor to wait on
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain the format of this better.

@joydddd joydddd force-pushed the joydddd/stack/9 branch from 30d6f68 to 0578270 Compare July 8, 2025 19:02
@joydddd joydddd changed the base branch from joydddd/stack/9 to main July 8, 2025 19:07
@joydddd joydddd force-pushed the joydddd/stack/5 branch from 7637caf to c864ea5 Compare July 8, 2025 19:07
@joydddd joydddd force-pushed the joydddd/stack/5 branch 2 times, most recently from 90baafc to 3044010 Compare July 8, 2025 20:29
@joydddd joydddd requested review from drisspg and jansel July 8, 2025 21:08
stack-info: PR: #189, branch: joydddd/stack/5
@joydddd joydddd force-pushed the joydddd/stack/5 branch from 3044010 to cc61ed5 Compare July 9, 2025 18:20
@joydddd joydddd merged commit bd0f27a into main Jul 9, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants